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[57] ABSTRACT 

Parallel processing is perfonned by dctcnnimng sequential 
ordering of tasks foi processing, assigning priorities to die 
tasks available on die basis of the sequential ordering, 
selecting a number of tasks greater than a total number of 
available parallel processing elements from all available 
tasks having die highest fHi(»ities, partidoning die selected 
tasks into a number of groups equal to the available number 
of parallel processing dements, and executing the tasks in 
the paralld processing elements. 
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METHODS AND MEANS FOR SCHEDULING FIG. 7A is a flow chart diowing another embodiment of 

PARALLEL PROCESSORS the process io a portion of FIG, 7. 

FIG. 8 is a Mock diagram of a pcvtion of FIG. 4. 

FIELD OF THE INVEimON 9 and 9A are flow charts of another embodiment of 

This invention relates to methods and means for sched- ^ the operation of the assignment manager, 

uling tasJcs performed by parallel i»:ocessors, and particu- fig. 9b is a flow chart showing another cmlxxiiment of 

lariy to concurrent execution of operations in a number <rf the process in a pcfftion of FIGS. 9 and 9A. 

functional units under the control of an assignnacnt manager. jqQ^ and 11 arc diagrams iUustiating nodes occurring 

BACKGROU>rD OF THE INVENTION 10 operational theory of die invention. 

Traditional data processing systems execute a sequence of DETAILED DESCRIPTION OF FKHPERRED 
instructions one at a time. To increase the execution speed of EMBODIMENTS 
computers, multiple processors are used to perform paraUel pjQ j uiustrates an embodiment of the invention as a 
processing of primitive operations, or tasks, of an algorithm. ,5 ^lock diagram. Here, a preprocessor PPl translates 
Such paraUelism often requires substantiaUy more space (compUes) an incoming program into a suitable form for 
(memory) than sequential operations. execution, for exaix^le, madune4evel instructions. Accord- 
Many parallel programming languages allow for parallel ing to an embodiment of tiie invention, the preprocessor PPl 
processing in separate functional units. The goal of these is optional and may be omitted. The incoming program may 
languages is to have the user expose the full parallelism and 20 be any sequential program that takes the form of some 
have the language implementation schedule the parallelism progranmiing language that reveals flic tasks to be pcr- 
onto processors. Costs of such operations can be measured formed by parallel processing but not die assignment (or 
abstractly in terms of the total number of operations mapping) of tasks to processors. The program may be such 
executed by the program, ic. the *Vork". and the length that die set of tasks and the relationships brtwecn them arc 
the longest sequence of dependencies, the i.e. the **depth". 25 determined by (dependent on) the program*s input data, and 
Peiformaoce anomalies in such arrangements are often $0 are revealed only during the parallel processing of the 
conunon. Heuristics used in the implementation often fail. program on its input data. 

Such systems 60 not necessarily offer good performance, ^ assigmncnt manager AMI determines tasks available 

both m terms of time and space. f^j, scheduling and assigns a subset of these tasks to a system 

An object of the invention is to improve such methods and 30 ^yi containing processing elements PEl and a router KTl 

nieans. shown in FIG. 2. Specifically* the assignment manager AMI 

ctnun* ADv rMTTur: Txrvcxmrkxi suppUes a set of available tasks to be executed by each 

SUMMARY OF THE INVENTION i^occssing clement PEl. For cadi processing clement PEl 

According to aspects of the invention such ends are the router KTl routes the set of tasks to be executed and 

achieved by determining sequential ordering of tasks for supplied by the assignment manager AMI to a task buffer 

processing, assigning priorities to the tasks available on the (not shown) within each processing element PEl. 

basis of the sequential ordering, selecting a number of tasks Each processing element PEl in the system SYl of 

greater than a total number of available processing elements processing elements executes the instructions of the tasks in 

&om all available tasks having the highest priorities, parti- its task buffer, and informs the assignment manager AMI 

tioning the selected tasks into a number of groups equal to ^ when tasks are completed. The assignment manager AMI 

the available number of parallel processing elements, and proceeds as long as there are more program tasks to be 

processing the tasks in die parallel processing elements. executed and as long as the program is not conc^letcd. 

These and other aspects of the invention are pointed out The processing dements PEl receive input data upon 
In the claims. Other objects and advantages of the invention which the tasks of the parallel program operate. The pro- 
will become evident when read in light of the acconipanying cessing elements PEl then ouQ>ut program output data, 
drawings. FIG. 2A illustrates anotiicr embodiment of the system 

^ „r^,^« SYl of FIG. 2, Here, the functions of die processing 

BRIEF DESCRIPTION OF THE DRAWINGS ^ ^^^^ j^,^ compuuUon demciu CE and 

FIG. 1 is a block diagram of a system embodying features jo "memory elements ME. The router RTl again routes the tasks 

of the invention. from the assignment manager AMI to the processing ele- 

FIG. 2 is a block diagram illustrating details of the tasats PEl in the form of computation elements CE and 

processing clement array of FIG. 1 and embodying features memory elements ME. Each computation clement CE reads 

of the inventions writes locations io any memory element ME (or possi- 

nG,2Ais a block diagram of another embodiment of the 55 Wy only a subset of the memory elements ME) via the router 
processing array in FIG. 1. 

FIG. 3 is a block diagram illustrating detaUs of processing „ 3 iUustrales details of the processing elements PEl. 

elements in HGS. 2 aTd 2A. * intafacc RIl connects to a task buffer TBI. 

..... ^. _f *u • * a processor PRl, and a memory MEl all coupled to each 

FIG. 4 iS a block diagram of the assignment manager m ^ ^ ^^^^ ^1 

^ ^ . ^ „ assignment manager AMI. Whenever the processor PRl is 

no. 5 is a flow chart of the opaaUons of FIGS. 1 to 4. idle, it removes a task from the task buffer TBI, andcxecutcs 

FIG. 6 shows details of a step in FIG. 5. it 

FIG. 6A is a flow chart illustrating another embodiment of a feedback exists from the processing elements PEl to 

die flow Chan in RG. 6. 65 the assignment manager AMI regarding die completed 

FIG- 7 is a more detailed flow chart of die operation of the execution of tasks. According to one embodiment of die 

assignitkent manager. invention such feedback occurs upon completion of a task or 
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set of tasks. The processing clement PEl then sends an 
acknowledgement to the assignment manager AMI via its 
router RTl. According to another embodiment the process- 
ing clement PEl places die acknowledgment in a separate 
task buffer which can be read by the assignment manager. 

The memory element M£l contains the memoiy available 
to die processor PRl. In the prcfecred embodiment a 
processor reads and writes cemin locations in the menK»ry 
dements MEl residing in other |»occssing elements PEl by 
communicaiing via the router KTl. The task buffer TBI can 
reside within the memory clement MEl or form a sq>arate 
memory device. 

Details of die assignment manager AMI of FIG. 1 appear 
in FIG. 4. Here a task queue TQl contains a set of tasks 
available for scheduling (not necessarily all such tasks). A 
task assigner TAl removes tasks from the task queue TQl 
and assigns them to the system S Yl of processing elements 
FGl and supplies a set of zero or more tasks in the task buffer 
TBI for each processing element PEl. 

A task queue and status buffers manager (TSM) BMl adds 
tasks to the task queue TQl. The task queue and status 
buffers manager BMl uses the task queue TQl and status 
buffas SBl to determine tasks available for scheduling. The 
status buffers SBl include the necessary information on the 
relationship between tasks, e.g.. tasks that need to synchro- 
nized upon completion. The task queue and status buffers 
manager BMl uses the program and feedback information 
obtained from die system SYl of processing elements [%1 
to update the task queue TQl and the status buffers SBl. 

A task is "available** if it has no precedent that must t>c 
accomplished before execution of that task. That is, some 
tasks cannot be executed until one or more preceding tasks 
have been completed. Such a task is said to have a precedent 
restraint Such a task becomes "availaWe** upon completion 
of the all its preceding restraining tasks. Some tasks, at the 
outset have no precedents diat require conviction. Such 
tasks arc available at the staxt 

Sequential programs intended for use with a single pro- 
cessor usually employ a sequential scheduler that designates 
each task of a program with a code or characterization that 
identifies the ordering of the task in the sequence of instnic- 
cions. Thus each task has a designation identifying its order 
in the schedule. 

The invention utilizes the ordering of tasks in the sequen- 
tial scheduling to select a subset of the available tasks for 
parallel processing. That is, the invention selects a subset of 
available tasks for parallel processing by assigning higher 
priorities to the carlia available tasks in the sequential 
schedule. 

FIG. 5 is a flow chart of the operation of the system in 
nOS. 1 to 4. Here, in step 504, the program is loaded into 
the preprocessor PPl. In step 507, the preprocessor PPl 
translates the program into a form suitable for die particular 
elements in the system. Tlie assignment manager AMI, in 
step 510, determines die tasks available for scheduling and. 
In stq) 514, assigns the tasks to processing elements as 
shown in the flow charts of FIGS. 6 and 7. The processing 
elements PEl then execute the tasks in step 517 and die 
assignment manager AMI, in step 520, asks whether die 
program is complete. If the answer is yes. die assignment 
manager AMI stops die operation; if no. die assignment 
manager returns to stq> 510. 

Details of step 514 appear in FIG. 6, Here, in sicp 604, die 
assignment manager AMI assigns priorities to the tasks 
available for scheduling according to an ordering that is 
determined by a particular sequential scheduler of all the 
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tasks in the program, prc-sclected at the start of the method. 
The scheduler is a known type such as a 1 DFT (depth first 
traversal) scheduler. Depth first traversed schedulers arc 
discussed below under ^Theory**. The sequential scheduler 

5 serves not only for those tasks that are now available for 
scheduling, but for all the tasks as determined by a sequen- 
tial execution of the program that Is independent of the 
parallel execution. 

According to one embodiment of die invention, the 
assignment manager AMI indudcs a number of processors 
which may operate in parallel. According to another 
enibodimcnt, Uie assignment manager AMI utilizes the 
processing elements PEl to perf<Hm die parallel steps. 

In stq> 607. the assignment manager AMI selects some 
number N of available tasks which have the highest assigned 
priority, where N is typically, but not necessarily, more than 
the number of processing elements and less than the maxi- 
mum possible available tasks. 

In step 610. the assignment manager AMI partitions the 
N selected tasks to p groups of size appiox (N/p) each, where 
p is the number of available processing elements PEl. In 
step 614, the assignment manager AMI assigns each group 
to one of the processing elements PEl. 
According to an embodiment of the Invention the assign- 

23 ment manager AMI includes a number of parallel proces- 
sors. Hie assignment manager AMI then performs its 
functions, in steps 604, 607. 610, and 614. in a small number 
of parallel steps. Otherwise it performs its functions in 
ordinary sequence. ^ 

30 Another embodiment of the invention serves to assure that 
the number N of selected tasks is not so large as to take up 
too much memory. For this purpose a limit L is placed on the 
number N of tasks selected in step 607. The application of 
this limit in step 607 appears in FIG. 6A Here, in stq) 650. 

35 the assignment manager AMI designates a limit L on the 
number N of selected tasks, on the basis of memory avail- 
able at any time for a group of tasks, and memory available 
for the bookkeeping for this group of tasks. The value of L 
can diange with available memory, Alternatively, the limit L 

40 is designated at the start of die program and entered in step 
650. Step 654 asks whether the available tasks M are equal 
to or greater than the number L. If yes, step 657 sets N=L and 
the process continues to step 610. If no. i.e., if M<L step 670 
sets N=M. The process then advances to step 610. 

45 no. 7 is a flow chart of another embodiment of the 
invention and shows details of the operation of the assign- 
ment manager AMI. In step 704. the task queue and buffers 
manager BMl of FIG. 4 in the assignment manager AMI 
reads the initial program instjuctions to determine die set of 

SO tasks diat are ready at the start of die program. In step 707. 
the task queue and buffers manager BMl then Initializes die 
stams buffers SBl to keep a suitable record of diese tasks. 
In step 710, the task queue and buffers manager BMl 
assigns (»ionties to the tasks in the buffers and places, in the 

55 task queue TQl, a number N of ready high jmority tasks 
from diose residing in status buffers SBl, based on die 
records. According to another embodiment of the invention 
the assignment of priorities is performed by status buffers 
SBl. In step 714, die task assigner TAl removes die tasks 

60 from die task queue TQl and supplies distinct tasks to 
the task buffer TBI for each processing elentent PEl. 
According to stiU another embodiment of the invention, the 
assignment of priorities occurs in die task assigner TAl. In 
step 715, the processing elements PEl then execute die 

65 tasks. 

Then, in step 717. the task queue and buffers naanager 
BMl is informed by die system SYl of processing elements 
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' FBI that all assigned tasks have completed la step 720, tfie The process then advances to stq> 950 in FIG. 9A. There, 
task queue and buffers ntanager BMl uses the program and the task queue and buffers manager BMl uses die task status 
the feedback information to updstt the status buffers SBl support buffers TSSBl information to determine newly 
and include records of newly ready tasks. In step 724, die ready tasks. The task queue and buffers manager BMl. in 
task queue and buffers manager BMl asks if all tasks have 5 step 954, then updates (he ready task multiway stack RSI to 
been conq)leted. If yes, the operation ends, if no. the contain all ready tasks in a DFT (d^tii-first traversal) order, 
operation returns to step 710. Step 957 involves die question whether the process is 
FIG. 7A is a flow chart of another embodiment of step completed in that all steps have been performed? If no, the 
714, Here, the steps follow step 710, In step 750, die task I»ocess returns to step 910. If yes, the process stops, 
assigner TAl removes the tasks from the task queue TQl. lo pxG. 9B is a flow chart of another embodiment of step 
According to one embodiment of the invention, the tasks 914. Here, the stq>s follow step 910. In step 980. the task 
have already been assigned priorities in st^ 710 by Ihe assigner TAl removes die tasks from the task queue TQl. 
buffers manager BMl or the task queue TQl. and the task According to one embodiment of the invention, the tasks 
assigner TAl receives only die high pric^ity tasks. have already been assigned priorities in step 910 by the 
' According to anotficr embodiment the task assigner TAl buffers manager BMl or die task queue TQl, and die task 
assigns the priorities. In step 754. the task assigner TAl assigner TAl receives only the high priority tasks. Accord- 
weights die tasks on the basis of con^lexity. In step 757, die ing to anodier embodiment, the task assigner TAl assigns 
(ask assigncrTAl divides the selected tasks in groups among the priorities. 

(he number of available processing elements PEl on the In step 9S4, the task assigncrTAl weights the tasks on the 

basis of the wdgfiting, so diat die total weighting and hence ^ basis of con^iexity. In step 987, die task assigner TAl 

complexity of each group is adapted to the ability of the divides the selected tasks in groups among die number of 

processing elements PEl to handle processing at that time. available processing elements PEl on the basis of die 

If the processing elements PEl and dieir abilities are the weighting, so that the total weighting and hence complexity 

same, the task assigner TAl divides the selected tasks into of each group is a|)proximately the same. In step 990, die 

groups which arc weighted approximately the same among ^ task assigner TAl supplies the groups of tasks to the task 

the number of available processing elements PEl. In step buffer TBI for eadi processing element PEl. The process 

760, the task assigncrTAl supplies the groups of tasks to the then returns to step 915. 

(ask buffer TBI for each processing element PEl. The According to other embodiments of HGS. 1 to 4. die 

process then returns to step 715. ^ assignment manager AMI can have a sequential 

HG. 8 illustrates an embodiment of die status buffers SBl (centralized) or parallel (distributed) implementation. A par- 

in FIG. 4. Here a status buffer SBl includes a ready task alld in^lcmentation is executed on the system SYl of 

multiway stack RSI and task status support buffers TSSBl. processing elements FEl or on a separate system. The 

These are used in the embodiment of die operation of the operation of status buffers manager BMl and the task 

assignment manager AMI shown In FIGS. 9 and 9A. assigncrTAl can be executed by the processor elemenu FBI 

In step 904 of FIG. 9. die task queue and buffers manager or by a separate set of parallel processors, and die task queue 

BMl of FIG. 4 in the assignment manager AMI reads the TQl and die status buffers SBl can be in^lemented in die 

initial program instructions to determine the set of tasks that task queue and status buffers manager BMl or in separate 

are ready at the start of die program. In step 907, die task memory devices. The elements of FIGS. 1 to 4 may be in die 

queue and buffers manager BMl places these tasks in die ^ form of discrete structures ot may be processors or parts of 

ready task multiway stack RSI and suitable records in die processors diat perform the required functions, 

task status suppcfft buffers TSSBl. The invention achieves reduced parallel-processing 

In step 910. the task queue and buffers manager BMl memory requirements by selecting a subset of available 

places die first N tasks residing in die ready task multiway tasks for parallel processing and assigning higher jiaiorities 

stack RSI into the task queue TQl. As part of die operation 45 to die earlier available tasks in the sequential schedule. The 

of step 910. the task queue and buffers manager BMl process of die invention applies groups of tasks (0 die 

assigns priorities to die tasks in die buffers and places, in the parallel processing elements on the basis of dieir priorities. 

taskqueueTQl.somenumberNof ready high priority tasks When the process at any stage spawns new tasks, they 

from diose residing in task status support buffers TSSBl of take the place in the schedule ordering of die parent tasks 

status buffers SBl. based on the records. 50 that spawned them. According to another embodiment of the 

According to another embodiment of the invention the invention, the ordering in the sequential scheedule reserves 

assignment of priorities is performed by die buffo's TSSBl spots for spawned tasks, and the spawned tasks are place in 

of status buffers SBl. In step 914, die task assigner TAl those spots, 

removes the tasks from the task queue TQl and supplies N/p xHunD v 

distinct tasks to die task buffer TBI for each processing 35 THEORY 

element PEl. According to still anodier embodiment of die The invention is based on die following theoretical back- 

invention, die assignment of priorities occurs in die task ground. We specify universal implementations that help 

assigner TAl. In step 915, the processing elements PEl then assure performance bounds, bodi in terms of time and space 

execute the tasks. In step 915, die processing elements PEl (Le. memory). These are specified by placing uppa bounds 

then execute the tasks. tio on the running time and the space of the implementation as 

In step 917 the Cask queue and buffers manager BMl is a function of die work, dcpdi and sequential space. We 

informed by the system SYl of processing elements that all formalize the notion of work, depth and space, by modeling 

assigned tasks have completed. The task queue and buffers computations as directed acyclic graphs (DAGs) diat may 

manager BMl, in step 920, uses the program and die unfold dynamically as the computation proceeds. DAGs 

feedback information to suitably update the task status 65 appear in the articles of R. D. Blumofe and C. E. Leiscrson. 

support buffers TSSBl and include records of newly Space-efficient scheduling of multithreaded computations, 

spawned tasks. In Proc. 25di ACM Syrap. on Theory of Computing, pages 
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362-371. May 1993 and of R. D. Blumofe aod C. E. can allocate mcmoiy in arbitrary amounts, the same space 
Lelserson. Scheduling multithreaded computations by w<Hk bound can be obtained using at most "(W-fSi)^D" steps, 
stealing. In E¥oc. 35ti) IEEE Symp. on Foundations of and in genecal there exists a trade-off between space and 

Computer Science* pages 356-3538. November 1994. steps. Tbc space bound implies, for exanq^le. that the 

The nodes in the DAG represent unit-wc^k tasks, and the 5 memory required for matrix muit^lication is "Si+0(p log 

edges represent any ordering dependencies between the n)". which is within a factor of '*1-K)(ir of the best possible 

tasks that must be respected by the implementation. This is for all **p in o(a^f\g n)". The above bounds do not account 

a very general modeL modeling even programs (sudi as for overheads to inqilement the schedule, 

parallel quicksort disclosed in J. Ta Ta. An Introduction to A common aj^oadi (e.g. as described in the aforemeo- 

Parallel Algorithms. Addison- Wesley. Reading. Mass.. i° tioned article R. D. BluiKK)fe and C E. Lelserson. Space- 

1992) whose task structure or data dependencies are efficient scheduling of multithreaded computations. In Proc. 

revealed only as the execution proceeds. The work of a 25tfa ACM Symp. oo Theory of Computing, pages 362-371. 

computation corresponds to the number of nodes in the May 1993.) is that of greedily scheduling *^** independent 

DAG, and the depth corresponds to the longest path in the nodes of the DAG each step, if possible, where "p" is the 

DAG. To account for memory usage, a weight is assigned to number of processors. 

each node of the DAG that rqvesents the amount of memory To obtain desired space bounds, attention is focused od 
that node needs to allocate or deallocate. The sequential bounding the increase in memcry for program variables, 
space of a computation is then defined as the ii^Hit space since our solution suffices as well to bound the increase in 
plus the space required by the depth-first traversal (DFT) of memory for task bookkcq)ing. Labeling individual nodes 
its DAG (the traversal taken by standard sequential ^ with their memory requlresnents. allows for more fine- 
implementations), grained mcnKxry idlocation than in previous models that 

FIG. 10 illustrates the task structure of a matrix miiltipli- associate memory requirements with entire threads in the 

cation computation (for **n=^**) represented as a directed coroputation as mentioned in the aforementioned R. D. 

acyclic grs^. Nodes NOl represent unit-work tasks, and Blumofe and C. E Leiserson articles. Block memory 

edges EDI (assumed to be directed downward in the figure) ^ allocations, e.g. for arrays, arc indicated by nodes whose 

represent control and/or data flow between the tasks. A wci^t is the size of the block to be allocated, 

level-by-lcvcl sdicdulc of this graph requires **e(n')- space The primary question is which (greedy) parallel 

for program variables, in order to hold the n^ intermediate schedules, if any. have provably good space bounds on all 

results required at the widest level of the graph. Moreover. computation DAGs. A first key point defines a class of 

such a schedule may use 6(n^) space for task bookkeeping, ^ parallel schedules that arc based oo given sequential 

in order to keep track of tasks ready to be scheduled. Note schedules, such that the sequential schedule dictates the 

that the standard depA-first sequential schedule of this gra(^ scheduling priorities to the parallel schedule, 

uses only e(n^) space, counting the space for the input and parallel schedule, although based on a given sequco- 

oulput matrices. tial schedule, will almost always schedule nodes oui-of- 

Any parallel schedule that makes good use of the proces- order (Le. prematurely) with respect to the sequential 

sors will almost always schedule tasks in a different order schedule, in order to achieve the desired parallelism on each 

than the sequential iniplementation. This can result in an step. A sccoxmI key point is to focus on these ^cmature" 

ina-ease in both the memory needed for task bookke^ing nodes, and show with a careful argument that the number of 

(to keep trade of pcihaps a larger set of ready tasks at each ^ premature nodes at any step of the ^"-processor schedule is 

step) and the amount of memory allocated to program atmost*V"tiii^<^<^cp^<**D''*of^c^^G-'I^"^^^cs 

variables (to hold a possibly larger set of variables tfiat have a lower bound shown and justifies the use of parallel 

been allocated but not yet deallocated). schedules based on sequential schedules. 

To achieve efficient schedules, a class of parallel sched- A third key point is to use this bound on the number of 

ules that are provably efficient in both space and number of 4^ premature nodes to bound ihc space requirements. At each 

stq>s. for any dynamically unfolding DAG are first identl* parallel step of the con^tation, each premature node may 

fied. If a computation has work "W** and depth "D"*. and require space beyond that needed for the sequential 

takes "S/* sequential space, then a '*p**-pa'ocessQr schedule schedule, due to additional allocations performed by diese 

from this class offers the following advantages. nodes and in order to keep track of any additional nodes 

Thcxe are at most **W/p^D'' steps in the schedule. This is 50 ^^^^V ^ scheduled. An approach is shown for ensuring 

always within a factor of two of the best possible over all that the extra space for premature nodes is linear in dicir 

schedules. For programs with sufficient parallelism (i,c."W/ number. Since there are at most **p-D** premature nodes, an 

p»D**). this is within a factor of "l+o(l)** of the best "Sj+CXp-D)** space bound is obtained, where "Si** is the 

possible. The computation uses only "Si+0(p D)" space. space for the sequential schedule. 

This includes space for program variables and for task 55 The at>ove results apply to any parallel schedule based on 

bookkeeping. Thus for programs with sufficient parallelism a sequential schedule. Since the standard sequential schcd- 

(i.e. Si/p»D) , recalling that *'Si** is at least the size of the ule is a depth-first traversal (DFT) of the DAG. definition 

input), this is within a factor of ** l+o( 1)** of the sequential and consideration arc given to **p**-DFr schedules, a class of 

space. This contrasts with known bound such as "S^ -p** (See parallel schedules based on sequential depth-first schedules. 

F. W. Burton, Storage maiugement in virtual tree machines, ^ There arc a numba of ways one might think to define a 

I£EETrans.on Computers. 37(3)321-328, 198S; the afore- parallel DFT; ^e definition sown has provably good per- 

mentioned R. D. Blumofe and C. E. Leiserson articles, and foimance bounds. Note that this schedule, denoted a **p**- 

F. W. Burton and D. J. Simpson, Space efficient execution of DFT schedule, differs from a schedule that gives priority to 

deterministic parallel programs. Manuscript, December the deepest nodes at each step. 

1994) which is a factor of **p'* from the sequential space. S5 To obtain an efficient scheduling process a number of 

These bounds apply when individual tasks allocate at definitions are required. The ^)>*'-DPT class of parallel 

most a constant amount of memory. When unit-work tasks schedules here defined arc provably efficient in both time 
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and space relative to the staadard sequential depth-first ogy to describe the computation DAGs and their schedules, 

traversal. A second main result is an efficient runtime Including the general model employed to measure the space 

(online) scheduling algorithm for gcncratiog **p'*-DFT required to execute a program according to a given schedule 

schedules, for languages with nested fine-grained parallel- of the DAG. We use standard graph terminology (sec, e.g.. 

ism (Leianguages that lead to series-parallel DAGs). We 5 x. H. Connen, C. E. Leiserson. and R. L, Rivest, Introduc- 

show how the processor aUocation and task synchronization ^qj^ Algorithms, McGraw-HiU, New York, N.Y.. 1990). 

needed to implement ^-DFT schedules can be pofc^ Consider a directed acyclic graph A •^"-traversal (or 

by the processors on-tbc-fly with imnimal overhead. For a « , *v.> i^ic « c«,m^«^*. «f •*r«,.> 

computkionwith-'W-work,-D-depth,and-^rsequential f ^^rT' ^nT^? , is a sequence of nau^i 

space, the resulting scheduled coMUtation, indudkg these ^teps, wh« each step "i . i=l tau". defines a set of 

oveiheads, obtains the foUowing bounds. We obtain "0(W/ iiodcs/y visited, or scheduled at this step), such 

P^■D.log p)" time on an EREW VRAM, and the same time that the following three properties hold: (1) each node 

bound, witfi high probability, on a hypercubc. for EREW- appears exactly once in the schedule; (ii) a node is scheduled 

style confutations (Le.no concurrent reads or writes). On all its ancestors have been scheduled in previous 

the stronger CRCW FRAM, a variant of the algorithm steps; and (iii) each step consists (rf at most **p'' nodes, A 

obtains **0(Wyt>fDloglog p)" time worst case or *'0(W/pf traversal (or schedule) of "G** is a **p"-traversal (or 

D log* p)" tune with probability, for CRCW-style schedule) of for some •*p". 

computations (i.e.coDCurrent reads and writes permitted at Consider a traversal *T=Vi, . . . V/ of "G". A node '\ in 

no extra cost). These work-efficient schedules use only q» scheduled prior to a step "r in if *V** appesis in 

'Thcta(Si+D'p log p)** or better space. If tasks can allocate cup s cup V, i**. A node *V is ready prior to step in 

memory in arbitrary amounts, the same bounds can be 20 if aU its ancestors. An unscheduled node ^V" is ready at 

obtained as long as die sequcntiid^^^^ --r if all its ancestors (cquivalenUy, aU its 

the sequential runmng time (re. S, m 0(W) ) J ^^^^ ..^^ ^^^^^ ^ 

These results apply to nearly all data-paraUcl languages, ^, is the set of all nodes scheduled prior to "i'' 

both n^ted and not. as Y,^^ ^ '"<^^ ^SJI^S^^^^^^^ within unsdieduled child node. A greedy V-tniversal (see 

fork-and-jom Style par^ehsm, even penmttmg arbitrary ^ ^^^^^^ ^ ^ LeisersoS! Space^cient sched- 

fanoutandaibitiBiynesUag. uling of multidueaded computations. In Proc. 25th ACM 

First, we observe that for the dass of languages we ^ ^ Confuting, pages 362-371, May 

consider, the task strucmre of a computation is a dynami- ^i^: . v" , ^JfCu * * ^^y* t i.,/* 

cally unfolding series-parallel DAa with cextain'nodes 19930 is a V-trave^al such Aat at ea^ '^"^t^"' 

haviigarbitrai^fanoutknd other corresponding nodes hav- 30 *1>-* nodes are r«idy, then ^^^^ are 

ing artMtrary fanin. with certain nodes having arbitrary "^V- ^ " «>^^^ «^ ^ '^<^y 

fanout (called source nodes) and other, corresponding, nodes A depth-first 1-travcxsal (DFT or l-DFT) is obtained by 

having arixtrary fanin (called sink nodes). Next, we show maintaining a stack of ready nodes: the stack cwitains the 

tiial for such DAGs. a simple stack-based paraUel schcdul- root nodes initially (in any order), and at each step, the top 

ing algoridim yields a '^p'^-DFF sdiedule, as weU as speci- 35 node on the stack is pc^jped from the slack and scheduled, 

fying how the processors should be allocated to the tasks. and then any newly ready nodcs.are pushed on the front of 

Thus a parallel schedule based on a sequential one can be the stack (in any orto). The **i" th node in a DFT is said to 

constructed on-the-fly without knowing the sequential have DFT number 'V*. 

schedule. This algorithm avoids (he slowdown diat would be A (single source and sink) series-parallel DAG is defined 

required to explicitly construct a sequential dcptti-first ^ inductively, as follows: The graph, "Go**, consisting of a 

schedule. single node (which is both its source and sink node) and no 

In addition to the simple stack-based algorithm a modified edges is a series-parallel DAG. If **Gi" and "Gj" are 

algorithm, using lazy stack allocation, is given that obtains series-parallel then &c graph obtained by adding to "G^ cup 

tfiis bound on its stack space. Moreover, given the unknown G^** a directed edge from tfie sink node of **Gi" to the source 

and varying levels of recursive nesting in nested parallelism 45 node of **Gj" b series-parallel. If '"G, G/, **k^ ^"\^ 

computations and the possibly large fanin at a node, there are series-parallel, then the graph obtained by adding to "G^ 

difficulties in efBdently identifying nodes that become avail- cup-s cup G^** a new source node, * V, with a directed edge 

able for schedu^ng. We denote this subproblcm as die from '^u" into the source nodes of "G^ G^", and a new 

task-synchronization problem for sink nodes. A next key sink node, with a directed edge from the sink nodes of 

point is an eflBcient algorithm for this task-synchronization 50 "G| G*" into is series-parallel. Thus, a node may 

problem- We use properties of **p"-DFT schedules on series- have indcgree or outdcgrec greater than 1. but not both. We 

parallel DAGs to argue the correctness of our algorithm and say that the source node, "u", is the lowest common source 

its data structures, as well as its resource bounds. node for any pair of nodes 'Sv in G " and *V in G/ such that 

Our stack-based scheduling algorithm pciforms a con- "i^***. 

stant number of paraUel prefix-sums operations and a con- 33 For paraUel schedules based on sequential ones, we define 

Slant number of EREW PRAM steps for each round of a -traversal. *T to be based on a l-traversaL 'T^". if. 

scheduling. To amortize these scheduling overheads, we at each step "i" of 'Tp", the "k^" earliest nodes in "Tt" that 

schedule muit^>le tasks per proccssOT at each round by are ready at step "i'* are scheduled, for some "k^p**. In other 

perfonning a parallel DFT whose scheduling width is larger words, for all ready nodes **u*' and "v", if precedes **v" 

than the number of processors. This increases the additive 60 in 'Ti". then either both are scheduled, neither are 

term in the time and space oonopleTUtxes, but ensures that the scheduled, or only **u" is sdieduled. Note that given a 

parallel work is within a constant factor of c^timal. We also l-travcrsaL the greedy •'p*'-travcrsal based on the 1-traversal 

show how a near-greedy scheduling algorithm using is uniquely defined. 

approximate prefix-sums can t>e used to in^vove the time An important **p''-traversal that we consider is the dq)th- 

complexity for the more powerful CRCW PRAM niodeL 65 first **p"-travasal. A depth-first '*p"-traversal (*'p'*-DFT) is a 

The theory underlying the invention is t>ased on compu- **p"-traversal based on a depth-first 1-traversal. An example 

tations as dynamically unfolding DAGs and uses terminol- is given in FIG. 11. In general, implementing a ^^p^-traversal 
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based OQ a 1 -traversal requires some method of detenniniog "stack frames**) as well as any dynainic memory use, and (ii) 

the relative order in the 1 -traversal among ready nodes at task bookkeepmg space used by the sdteduling algoritfam to 

each step of die **p**-traversal keep track of nodes ready to be scheduled. 

FIG. 11 shows a "greedy** **p--DFr of a DAG "G**. for Program variable space is the memory to hold the iiq>ul, 
't^S**. On the left, nodes N02 of ^XT are numbered in order ^ the memory f<a stack frames, the memory expUcitly allo- 

of a 1-DFT of "G**. On the right *Xj** is labeled according cated by program instructions, and the memory ImpLidtiy 

to the greedy **p"-DFT. 'T **, based on the 1-DFT; allocatcdtohold values computed by the ^MOgram. The input 

*Tp=Vi V7**. where for 'l^l T*, **V,Mbc set of space is assumed to reside in a preallocated block of 

nodes scheduled io step "i"*. is the set of nodes labeled "i** memory; the remainder of the variable space is allocated by 

in the figure. individual tasks. We assume that the amount of memory to 

The foUowing involves Dynamically unfolding DAGs, ^ allocated by a task is independent of the traversal cf the 
We model a con^tation as a DAG that unfolds dynamically DAG. For deallocatiMi, such an assuiiq)tion is overly rcstrio- 
as the program executes on a given input As in previous tive for languages thai rely on garbage collection to auto- 
work (e.g. C. H. PM>adimitriou and M. Yannakakis, Towards matically deallocate memory for values that are no longer 
an architecture-independent analysis of parallel algorithms, ^^^^ ^ particular, the mcmoiy for a value is ready to be 
In Proc. 20th ACM Symp. on Theory of Conqniting, pages dcaUocatcd as soon as the last task that references the value 
510-513. May 1988; R. D. Blumofc and C E Lciscrson. has completed. Thus certain deallocations are associated 
SJpace-effident scheduling of multithreaded computations. wi& a set of tasks. i.cftosc tasks that reference tiie value. 
In Proc. 25th ACM Sytap. on Theory of Conq)uting. pages s"ch that the last such task to be sdicduled is credited for the 
362-371, May 1993; R. D. BLumofe and C. E. Leiserson. ^ deaUocaUon. 

Scheduling multithreaded computations by wcrk stealing. In At any point in the conqxitation. the program variable 

Proc 35th TRRR Symp. on Foundations of Computer space in use is the input space plus the sum of ail the space 

Science, pages 356-368, November 1994). we assume die allocated by scheduled tasks minus the sum of ail the space 

programs are deterministic, in the sense that ttie DAG for a deallocated by scheduled tasks. We can assign a weight 
computation does not depend on the order in which nodes ^ **w{u)", to each task **u** diat represents the amount of space 

are sdieduled. There is a node in the DAG for each unit- allocated by the task minus the amount deallocated. We 

work task in the computation, which we identify with the assume that this weight is available to the scheduler prior to 

task. The edges represent any ordering dependencies scheduling (he node, or if we increase the depth, then we can 

between the tasks — if the program dictates that '*u** must be know the weight and hold-o£F on the allocation after the node 

executed before *V**. then there is a path in the DAG from ^ is scheduled once. For a prefix *T=V, V/ of a 

**u" to '^v**. Such ordering could be due to either data or **p"-iraversal, **p^r, we define "Spaoe(T)**, the program 

control dependencies. (c.g.*1i" spawns * V, •'u*' writes a variable space in use after *T'* to be Space(T>=n4-2/£^ip^^, 

value that *V" reads, **v** executes conditionally depending w(u), where "n** is the amount of space needed to hold the 

on an outcome at **u", or 'V** waits for at a synchroni- input TTiis definition assumes a common pool of memory so 

zation point), A node may have arbitrary indegree and thai any deallocated space can be re-used by subsequent 

outdcgrcc. The program is assumed to define an ordering on allocaticms. Moreover, by considering only the net effect on 

the edges outgoing from a node. We note that our DAGs the space of all tasks scheduled at a step, it ignores the 

differ from dataflow graphs since a dataflow edge from a fluctuations in memory use during the step as tasks allocate 

node "u" to * V need not be included in the DAG if diere is and deallocate. Such fluctuations can be addressed by split- 
anothCT path from **u" to *V— the DAG strictly represents ^ ting each node into two. one that allocates and one that 

ordering constraints and not the flow of data. deallocates, if desired. 

The DAG is dynamically unfolding in die following The space complexity o* maximum space of a '"p"- 

scosc: (1) when a node is scheduled, its outgoing edges arc traversal "Tp=Vi. . . . V/, is defined as (S^=max^i r 

revealed; and (ii) when all its incoming edges are revealed, ^ Space(Vj ,V^) l.cthe noaxinoum space in use after any 

a node is revealed and is available for scheduling. step of the traversal. 

In an online scheduling algorithm for a dynamically For Cask bookkeeping, we assume that a node in the DAG 
unfolding DAG. the scheduling decision at each step is that is identified with a task '"u*" is of constant size, and also 
based on only the revealed nodes and edges of the DAG.* that each edge is constant size. Consider the sequence. 
Initially, only the root nodes are revealed, and the algorithm of (revealed) edges outgoing from a node. Any consecutive 
must detect when new nodes are revealed in order to subsequence of **S** can be represented compactly in con- 
schedule them. stant space by storing the first and last edge in the subse- 

The depth of die DAG is the parallel depth, "D**, of the quence. However, an edge must be allocated constant stor- 

computation; the number of nodes in the DAG is the total age of its own before it can be used to identify the node at 
number of unit-work tasks, of the computation. Since 55 its otficr endpoint. Bookkeeping space for a node nuist be 

the programs arc deterministic, "D" and **W" are not allocated before the node can be scheduled, and can be 

effected by the nraversal order. Note that a detenninlstic deallocated after it is scheduled. Note that although the 

program may still be based on a randomized algorithm; in scheduling algorithm can base its decisions on all revealed 

such cases the DAG may dq>end on the values of the nodes and edges of the DAG at a step, it need not store all 

random bits, which may be viewed as part of the input data eo ^^^>^ ^ edges. In fact, it need only store sufficiently 

for the program. Our results may be extended to nondeter- many nodes and edges to be able to identify and schedule 

rainistic programs, e.g. programs with race conditions, ready nodes within the desired resource bounds for the 

alttiough then the bounds we obtain are based on worst case scheduling algorithm. 

DAGs over all travcrsals. The assumptions made in defining our computation model 
For a space model, we consider two categories of space: 65 with its space model are reasoitable for most fine-grained 

(i) program variable space defined by tasks when they are languages. For example, the model accurately reflects the 

scheduled, including space for task representations (the execution of a NESL program. 
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Haodliag laigc allocations: Our approach is to treat each Schedule the first "UnipJRJ}*" nodes ftom with the 

node that allocates **k'* mcmoay as "k/m" dummy nodes, the *T th node in "R" assigned to processor "i". 

last of which pcrfcmis the actual allocation of size *V. Replace each newly scheduled node by its ready children. 

Any greedy '^''-traversal based on a 1-traversal is effi- in left-to-rigbt order, in place in the array "R**. 

dent in bodj space and number of steps. Implementing such 5 Ready Nodes algorithm above produces the greedy 

*'p*'-lraversals requires some method of detennining the **p*'-DFT based on the l-DFTof"G'*. We show by induction 

relative order in the l-travcrsal among the ready nodes at on the steps of the y-DFT the following invariants: "R" 

each step, as weU as techniques for aUocating the scbcdoled contains f^cciscly the set of ready nodes, the nodes in '"R" 

tasks to processors, and for identifying ready nodes. In this are ordered lowest to highest by their 1-DFT numbers, and 

secUoo, we present an algorithm for fast implementation of lO tj,c scheduled nodes are a prefix of a greedy **p"-DFT of 

a particular greedy "p"-t^avcrsaL the depth-first "p"- "G". InitiaUy, the root node is the only ready node, so the 

traversal. Given the results in the previous section, the invariants hold for the base case. Assume the invariants hold 

V-DFT is perhaps the most interesting traversal to just prior to a step **t^ r. We show that they hold after step 

consider, since it enables direct comparisons between (he **t". Since by the invariants. ""R* contains the ready nodes 

space used by a parallel traversal and the space used by the j^^j .*j»t ^ ^ nodes arc in 1-DFT oidcr. then 

standard sequential traversal. the algorithm schedules the ready nodes with lowest DFT 

Our scheduling algcaithm applies to dynamically unfold- numbers. Second, at the end of step "T. **R- contains 

ing series-parallel DAGs. Such DAGs arise naturally from precisely the ready nodes, since scheduled nodes are 

languages with nested fork-and-join style parallelism; this removed from "R" and any newly ready node must have a 

includes nearly all the data-parallel languages (both nested ^ parent sdieduled this step, and hence will be added to ""R". 

and non-nested), as well as many others. A source node in Third, at the end of step **r, the nodes in **R** are in l-DFT 

these DAGs corresponds to a task that forks or spawns child order. To see this, observe Aat nodes in "R" arc unordered 

tasks; a sink node corresponds to the rejoining of these in ^'G'*. Hence by the aforementioned property stated relative 

parallel threads of control. Each source node may ^awn an to the series-parallel DAGs. the left-to-right ordered chil- 

arbitrary number of tasks on each step. This allows for ^ dren that replace a node, *'u'', will have lower DFT numbers 

shallower DAGs than if we restricted source nodes to binaiy than any node, 'V, to the right of '*u** in "R" just prior to 

fanout; however, it complicates both the scheduling of tasks step **t". or any children of 'V**. It follows by induction that 

and their synchronization. Data dependencies are between the algorithm produces a greedy **p"-DFT of 'tj". 

nodes ordered by paths in the series-parallel DAG. ^ The following involves (he P-Rcady Nodes algorithm 

Our computation model assumes the outgoing edges from The latter, according to ihis embodiment, stores, for each 

a node are ordefed. and we assume here that the standard ready node, only one of its parent nodes. Define the last 

1-DFT uses this "left-to-right" order. We also assume thai parent of a node, "V", to be the leftmost parent node of 'V 

each child of a node has its index among its siblings, that is scheduled in the step that 'V" becomes ready. Note 

corresponding to this left-to-right ordering. Finally, we that if is a source node, it has only one parent, so 

assume that the number of outgoing edges from a node is distinguishing such a parent node is necessary only for sink 

revealed when a node is sdieduled. even before any space nodes. To bound the resource requirements, we use lazy 

has been allocated for the edges. Our scheduling algorithm allocation, in which ready nodes are incorporated in the data 

performs a constant number of EREW PRAM (see the structure only when they are to be scheduled in the following 

aforementioned article by la Ta) steps and a constant ^ step. In the remainder of this section, we discuss the algo- 

number of parallel prefix-sums computations (see R. E. rithm and data structures in more detail. 

Ladner and M. J. Fischer, Parallel prefix computation. xhe P-Ready anay: The main component of the data 

Journal of the ACM. 27:831-838, 1980.) for each round of structure is an array. "Frontier*', which holds the last parent 

scheduling. for each ready node. Specifically, there are two types of 

A stack-based scheduling algorithm will use the following 45 nodes in 'Ttontier^: (i) scheduled source nodes with at least 

property of "r-DFTs on series-parallel DAGs. Consider a one unscheduled source child; observe the children of such 

1-DFT of a scries-paraUcl DAG "G". and let and * V be scheduled nodes will be ready; and (ii) scheduled sink nodes 

unordered nodes in "G" sudi that '^u" has a lower DFT that are the last parent of an unscheduled, ready child. For 

number than 'V. Then the DFT visits any descendant of '*u" each node **v** in "Frontier**, we keep track of the number of 

that is not a descendant of prior to visiting or any 50 its (unscheduled) children "c(vr. (At each step there may be 

descendan( of *V**. at most one source node for which only some of its children 

The proof follows from the foUowing observation, are scheduled.) As an invariant we maintain that nodes are 

applied to the lowest common source node of ^^u" and *V : represented in the array "P-Ready** in the order of their 

Let *V** be a source node in "G** with "k>r children, 1-DFT numbers. The size of array "P-Ready** can be 

**Ci c*". in 1-DFT order, and let *V** be its associated 55 l>ounded by the space requirement of the '^p^-DFT traversal. 

sink node. For ''i=l Jc**, let **G,** be the subgraph of "G** The following steps serve for processor aUocation: Corn- 
consisting of *'c, ** and all nodes that are both descendants of pute a prefix-sums computation on the "c(v)*' values for 
*'c," and ancestors of *V**. hen the following is a consecu- nodes 'V** that arc represented in the first **p" entries of array 
tive subsequence of the l-DFT: node 'V, then aU of "G/", "P-Ready**, Let ttie output sequence be in array X**. Let 
then all ofGa**. and so on, until finally all of "Gjt**. followed ^ "i'**satisfy *'C[i-ll<p < or =Cli'r (for simplicity, assume 
by node ' V. that "p=Cli']"). he children of the first 'T** nodes are to be 

The Ready Nodes algorithno. We assume for now that we scheduled. They are insetted in order into an auxiliary array 

can identify when a node becomes ready, i.e. when its last •^Active" of size **p** : For "i=l. . . . j***, the representations 

parent is scheduled. Consider a dynamically unfolding of the children of node '1** (in "P-Ready**) are placed in ordff 

series-parallel DAG **G*'. Let **R" be an array initially 63 in entries **C|i-l)+l** through **C|il*' of array "Active", 

containing the root node of "G**. Repeat the following two Processor **j** is allocated to a node refffesented in **Active 

steps until all nodes in "G** have been scheduled: (jj**. and visits the node; for each node "v**, the number of its 
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difldrea, "cCv)", is now revealed. For each sink code. " u". similady create a representative for their associated sink 

in "Active", if it is not the last patent of its child, set nodes. Each child "u", upon scheduling, copies the pointer 

"c(u)=0''; such items are marked for removal Use a prefix- to *%inkv" from its parent "v", and sets "sinku" to point to 

sums conmutation to compact the items in "Active' (hat are "sinkv". Note (hat after the children of "v" are scheduled, 

not marked for removal Tbt first "i*" entiies from array s nodeVisremovedfromthedatastiucturc. Anode will be 

"P-Ready" are cleared, and tfie contents of "Active" is lopt Jn «!» rSinT until it becomes rwdy for scheduling, 

prcpended (in ord*) to array «P-Ready". « o*"^« dismissed, as wiU be noted below. 

n,ePDFrp.Readylemma:H»e"r. Ready Nodes algo- ACcordin.ti«^ 

rithm above produce the greedy V-DFT based on die ""^ «? «=^«f <=«2i« ^ ^ 

l-DFT of "G- BsimplemeSuion takes a constant numb« lo «s<K.ated smk node^ 

of a"p"-processor^PRAMop«ations.plusaconstant «»^ *e to hnksprec«dy the nodes fiom the onginal list 

«f ^a^, ....«e ../«n.~.>.ri»n< «f •IT* »n" fvr om nf which are stul u the data structure; i.e.. nodes (hat are either 

STv.Sr «°n^'«^*"« ofsize-pperstepof ^^^^ „ In the array "P-Ready". When the Ust 

_ \ ' I . . . a ~. will become cnmty. the sink node "sinkv" will be ready for 

•nisk synchromzation: Idendlying ready nodes To com- .^heduling. TTiaefore. when the header of (he Ust (at that 

plete implementation, we show how we idenhfy when a scheduled, it ched^s whether the Ust 

node becomes ready. Consider a set of n «aiUd Uisks 4at ^^es empty. If it does, then it remains in anay "Active", 

have been spawned m parallel by a parent tosk (a source ^^^^ "P-Ready". Otherwise, the next 

node m our DAG with a fanout of n ). The usk- ^^^^ node in (he list will become the header of the 

synduonization problem is lo quickly detect whra the Ust scheduled header is removed from the data 

child completes so that we can restart the computati^ » structure. Note diat each node from the list that is scheduled 

parent ( .e. start the sink task cwiesponding to the source ^ ^ ^ ^^^^^^^ ^ 

task). Since the computatton is dynanuc, we do not know structure r -r j 

^cad of time which Maintaining die Coordination Ust: The Coordination Ust 

fte du^d compuuuons will Uke. F»«hennore we cann« jnainuincd unda deletions. The possible difficulty 

afford to keep the parent active smce this could to " consecutive nodes iTthe Ust can be 

work-inefficieocy (remember that such spawning can be j.,... ^ a.. ..^ ii.;»»~^i«wi..m,„-».*<-.,™i.Hn« 

nested), one wVy to L^lemen. task-^nduoaization is to ^'^"'^^0^*^^^-^'^^ 

ZTTc^:^T^'S^1:r^''t2. ^^^ient^et slow) i^l^ntadci using st^dard 

" . . r!;^ ^ , . .7 , u 30 techniques may be quite involved. We preseni a fast and 

multiple duldren am complete simultaneous y. however, ^ ,i^i,^^i,Uon to th2,rohlcm. utili^ 

dih reqmres a fetcA-and-add op^^aDon wh^^ is ex^ive ^ stnicture!^e key^sSvatioris that if a 

T""^ ^""Ti. ^^if^ ^ sequence of two or more adjac^t sibling nodes is deleted. 

PRAM). A second choice ^ representatives r«ide in consecutive entries of 

tasksjuespawnedwhichwiUbeused o^ amy "Activ^hence, updating the Coordination list for 

complete. Tlus. however, requires an *tXl^^ deleL sublist is as casy^dhLing the first and Ust entri^^ 

to go up the tre« when sync^onmng. and uakss dyiumuc J contain^g these representatives, 

load balandng is used, will also reqim-e extra wcrfc In ^ computations <L be o^ed using a 

particular das inylemcntaUon loses the advantage of allow- ^ co^^^on for the nodes represented in 

togf-^arbitratyfanm^ ^ "Active'' and^ be done, e.g., by pr^-sums. To 

depth ID the DAG required by bmaiy fanout « ^^^^ observation is coirek we liote the foUow- 

Description of the algoridun and data structures: To avoid ^ property of l-DFT numbers in scries- 

thc problems mentioned above, the ins)lemcnUtion is based j^^q^ gj^^^ 

on the foUowing points: ^^j^ ^ child "s", and a node 'V whose l-DFT 

( 1 ) We gcDcrate a coordinatioD list among the "n" children numbci is between that of *\r and * V, then must be ao 
when they arc spawned. ancestor of a node V** that shares the same sink child "s", 

(2) As each child finishes it rcniovcs itself from the list by such that the l-DFT number of 'V* is larger than that of *'u" 
short-cutting between Its two ndghbors. If neither neighbor smaller w equal to that of 'V (i.ejn**v'" may be **v"); ii) 
is finishing on die same stq>. the short-cutting takes constant nodes arc put In their Coordinadon list in die order of their 
time. 50 l-DFT numbers; (iii) nodes in "Active** are ordered by their 

(3) If imiltiple adjacent neighbors ftnish, we use a prefix- l-DFT numbers; (iv) a node only be deleted if it is in array 
sums computation to shortcut over all completing neighbors. "Active** ; and (v) after a node is put in array "Active**, its 
To make this possible we use properties of the DFT to show ancestors cannot be put there. Now, let *'u'* and **v" be 
that all neighbos-s diat are completing will be adjacent in the adjacent nodes in the Coordination list, *^i** prior to *V**, such 
task anay. Note that neighbors that are not completing might 55 that both arc deleted. By (iv), both should be in array 
not be in the task array at all since they might have spawned "Active**. For every node "w** between •'u'* and "v" in the 
duldren and be asleep. Coordination list, since ^V** was already deleted, then by 

(4) When the last child finishes, it reactivates die parent (iv) and (v), no ancestor of **w" can be in "Active". 
If multiple finish simultaneously, then the leftmost reacti- Similarly, since 'V** is in "Active**, none of its ancestors can 
vatcs die parent. 60 be in "Active**. The observation follows by (i). (ii) and (iii). 

Building sink pointers. When a source node *V** is scfaed- Cono^slcxity and extensions. Each step of die "p^'-DFT 

uled (Uirough array "Active**), a representative for its asso- involves at most •'p*' unit-time tasks, Wc say a nested 

dated sink node, "sinkv**, is created, and inserted into a set parallelism con^xitation uses concurrent access primitives if 

•'Sink'*. (A source node that is also a sink will be considered two or more nodes that arc unordered by its DAG read or 

for this matter as a pair of source-sink nodes.) The source 65 write die same program variable; if dicre are no such nodes, 

node will keep a pointer to this representative. When die die con5)utation uses exclusive access primitives. The 

children of a source node are scheduled, each of dicm will operations on the data structures in the scheduling algorithm 
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described above for out step of a "p^-DFT can be imple- Memory allocation procedures. The space bounds io the 
mented by using a constant oamber of steps on a '*p"- i^cvious theorems account for the absolute number of 
irocessor EREW PRAM plus a constant number of appU- memory cells used, without addressing the issue of explicit 
cations of a prefix-sums computation of size ''p". We obtain memory allocation for the data structures and for the pro- 
an optimal work implementation on '^p** processors by using 5 gram variables declared during the execution. Memory 
a **(p log p)'*-DFT, thereby amortizing the overhead for the allocation for the array data stnicttires is straightforward, 
resource allocation at each step (for simplicity, we state the Memory allocation for the set "Sink' data structure, as well 
bounds for the case where the sequential space is at worst ^5 program variables, can be done using a dynamic 
linear in the sequential running time): dictionary data structure. An adaptive allocation and deal- 
Theorem of exclusive access inqilementatioD. Consider a location of space, so as to maintain at each step space linear 
nested parallcUsm commutation with woric depth "D", number of representatives in the "Sink" data structure, 
and sequential space "S, e 0(W)". that uses (only) exclusive ^r in the numbo- of program variables, can be implemented 
access primitives. The .above sAe^Umg algOTithm can ^.^ processors and logarithmic time on the EREW 
implement the computation in "0(W^D l<>g P) ?J?f J^J^ pram (W. J. Paul, U. Vishkin, and H. Wagencr, Parallel 
^OCSj+p^logrt-'spac^ ^3 dictionaries on 2-3 trees. In Proc. lOdiInt CoUoquium on 
or withm the same bounds, with high probabihty, on a Languages and Programming, Springer LNCS 
*^p-^ocessor hyper1cube 154. pages 597i;6a9, 19830 a^^^ 

Proof: By the aforcmenU^^^ iBear^oric with high proba^iUty, on a 
lions (with constant "m"), there are t)(W/(plog p>+-D) ^f^^m^nri^n^TrrTMatJa/viAkin article/ 
steps in the **(plog p)--DFr we use. Each of these steps can th« aforementioned Oil, Matias, Vishkm 1991 arUcIc). 
be shown to take '*0(log p)" time on **p"-pn)ccssors. as 20 These automatic memory allocation procedures arc used 
foUows: A prefix-sums computation of size '*p log p" can be in conjunction with the above scheduling algorithm to obtain 
implemented on a **p"-processor EREW PRAM or hyper- time-, work-, and space-efficient execution of programs 
cube in "0(log pT time (see the Ladner and Fischer article). written in languages with fine-grained nested parallelism- 
Using random hashing techniques, the shared memory of a derived space and step bounds for executing a general 
"(P»og p)"-processor EREW PRAM can be placed on a 25 ^^^^ paraUel computations. For a more restricted 
•^''-processor hyperoibe so that each step erf a *Xp log oested-paraUel computations we described a sched- 
rt'^.proccssor EREW PRAM can be impleincnted on the algorithm and derived time bounds that include the 
hypercube in ^OGog p)" time mth high probabihty, (See L. ^^j^ computations with sufficient parallel- 
G. VaUant. Gencxal purpc^c parallel architectures in J. van ^ according to an embodiment of the 
Leeuwen. editor. Handbook of ThcoreUc^a Computer 30 invention; significantly improve previously known bounds. 
Science. Volume A, pages 943-972, Elsevier Sacnce Pub- . ^ \^ ^ r 
uTh^rB.V., AmsterZ,Tlie Netherlands. 1990.) IDus the invention has the ^^vantage of gencratmg space- 
scheduling can be done in 'tXlogpF time. Likewise, the ^ efficient implementations of parallel languages, and m par- 
log p" unit-work tasks can be pexformcd in -0(log p)" time, ticular the NESL language. 

Faster inmlemcntation on the CRCW PRAM. A faster 35 Tasks may be newly spawned at any stage of the process, 

execution can be obtained on the CRCW FRAM. by replac- Where such tasks are spawned, they may be fed bac* and the 

ing each prefix-sums computotion by cither an ^oximate component assigning such tasks may assign prionUes. 

prefix-sums computations or by a chaining con^jutation. While embodiments of the invention have been described 

Algorithms for approximate prefix-sums and for chaining in detail, il will be evident to those skilled in the art that the 

are known to lake "0(t^)", where **tap,=loglog p" in the 40 invention may be embodied otherwise without departing 

worst case (See Berkman and U. Vishkin. Recursive star-tree from its spirit and scope, 

parallel data structure, SIAM Journal on Computing, What is claimed is: 

22(2):221-242, 1993. See also T, Goldberg and U. Zwick. 1. A method of parallel processing, comprising: 

Optimal deterministic ^proximate parallel prefix sums and determining a sequential ordering of tasks for processing, 

their aR>Ucations. In Proc. 3rd Israel Symp. on Theory of 45 assigning priorities to available tasks on the basis of the 

Computing and Systems, pages 220-228, January 1995. and corUesi and then later in the sequential ordering; 

=logV with high probabiHty (see M T. Goo<hicJi Y ^f tasks greata than a total number of 

Matla^ and U. VU^n, ^P^^ J^f ^ JP^^^ availfble paraUel processing elements from all avail- 

H^'^.^^l P""^ ^ able tasks having Sie highe^ priorities; 

ACM-SIAMSyrap. on DiscreteAlgonthms. pages 241-250. 50 ... ^ f .i . . . \ u 

January 1994. RL, RagdeThe pi^aUel sin^city of com- partidonhig the selected tasks mto a ""'"ber of^o^^^ 

paction and chaining. Journal ofAlgorithms. 14*371-380. equal to the avaUaWe number of paraUel processing 

1993; and the aforemenUoned Berkman and Vishkin 1993 elements; and 

articles.) In order to use the proximate version of the executing the tosks in the groups in die paraUel processmg 

prefix-sums computation, we must allow for a small fraction 55 elements; 

of null cells in arrays "P-Ready" and "Active^ and allow for said determining step establishing an ordering with a 

a little less than '^p" to be scheduled at each step even if "p" specific predetermined sequential schedule that is indc- 

are available (as was already allowed to handle large pendent of the parallel execution, and said assigning 

allocations). step assigns priorities for paraUel execution on the basis 

The Theorem of concurrent access implementation. Con- 60 of the sequential schedule that is independent of the 

sider a nested paraUeUsm computation with work paraUel execution. 

depth "D", and sequential space "Si in 0(W)" . that may use 2. A racttiod as in claim 1, wherein the number of selected 

concurrent access primitives. The above scheduUng algo- tasks differs from the number of available tasks and differs 

rithm can implement the oonq)utation in "0(W/|H-D t^y* from the available number of paraUel processing elements, 

time and "0(S j+D-p t„,^,)*' space on a **p"-proccssar CRCW 63 3. A method as in claim 1. wherein the number of selected 

HIAM. where is 0(loglog p)" deterministicaUy or tasks is more than the available number of processors and 

**0(log*p)" with high probability. less that the maximum number of tasks possible. 
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4. A method as in claim 1, wherein the number of selected 
tasks is N and the number of available parallel processing 
elements is p. and the step of partitioning paitilions the N 
selected tasJcs into p groups of substantially N/p size. 

5. A method as in claim 1. wherein the step of partitioning 5 
includes a step of delegating each group to one of the 
parallel j»oces&ing dements. 

6. A method as in claim 1, wherein the step of assigning 
includes assigning in a nunober of paraUel steps. 

7. A method as In dalm 1. wherein spawned steps are lO 
[daced in the ordering of parent tasks that spawned them. 

8. A method as in claim 1, wherein said step of partition- 
ing is performed in a number of parallel steps. 

9. A method as in claim 1, wherein said sequential 
ordering is a Dq>th-first IVavcrsal (DFT) schedule. i5 

10. A method as in claim 1, wherein the step of processing 
spawns a number of tasks and the stq) of assigning pticaities 
includes assigning priorities to tasks spawned. 

11. A method as in claim 10, wherein die number of 
selected tasks is N and the number of available parallel 20 
processing elements is p. and the stq> of partitioning parti- 
tions &e N selected Casks into p groi^ of substantially 

size. 

12. A method as in claim 1. wherein the step of selecting 
Includes placing a limit L on the number N of selected tasks 2S 
from among a number M of available tasks, and if the 
available tasks M are equal or greater than the number L then 
N=L. and if M<L, N=M. 

13. A method as in claim 1, wherein the step of partition- 
ing iiKludes weighting the tasks and the step of partitioning 30 
includes dividing the tasks into the groups on the basis of the 
weighting. 

14. A method as in claim 1, wherein the step of deter- 
mining applies to each task of a program a designation that 
identifies the ordering of the task in Che sequence. as 

15. Anoethod as in claim 1, wherein the step of assigning 
priorides is on Che basis of high to low priorities for first to 
last in the sequential ordering such diat tasks are entered in 
the paraUel processing elements on the basis of high to low 
priorities for the first to last in the sequential ordering. 40 

16. An apparatus for parallel processing, comprising: 

a task sequential-ordering prq^rocessor fcr sequential 
ordering of tasks for processing; 

a task priority-assigning assignment manager responsive 
to the sequential ordering; 

a plurality of avaUable parallel processing elements; 

means for selecting a number of tasks greater than a total 
number of available parallel processing elements from 
all available tasks having the highest priorities; 50 

means f oa* partitioning the selected tasks into a number of 
groups equal to the available number of parallel pro- 
cessing elements; and 

means for entering the tasks in the groups in the parallel 
processing elements; 

said preprocessor including a sequential schedule chat 
establishes a predetermined ordering that is indepen- 
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dent of the parallel execution so diat j^orities for 
parallel executioa occur on the basis of sequential 
scheduling that is independent of the parallel execution. 

17. An apparatus as in claim 16, wherein the number of 
selected tasks differs from the number of available tasks and 
differs from die available niunbcr of parallel processing 
elements. 

18. An apparatus as in claim 16, wherein the number of 
selected tasks i s caoic dian the available number of pro- 
cessors and less that the maximimi numba of tasks possible. 

19. An apparatus as in claim 16, wherein the number of 
selected Casks is N and the number of available parallel 
processing elements is p. and Che means for partitioning 
partitions the N selected tasks into p groups of substantially 
N/p size. 

20. An ^^>aratus as in claim 16, wherein the means for 
partidoning includes a means for delegating each group to 
one of die parallel processing elements. 

21. An apparatus as in daim 16. wherein the assignmenC 
manager assigns in a number of parallel steps, 

22. An apparatus as in claim 16. wherein said assignment 
manager places spawned st^s in the ordering of parent Casks 
that spawned them. 

23. An apparatus as in claim 16. wherein said means for 
partitioning is performed in a number paraUel steps. 

24. An a{^)aratus as in daim 16. wherein said sequential 
ordering processor is a Depth-first TVaversal (DFT) prepro- 
cessor. 

25. An apparatus as in daim 16. wherein the preprocessor 
spawns a number of tasks and the assignment manager 
indudes assigning pri<»ities to tasks spawned. 

26. An a{^>aratus as in claim 25, wherein the number of 
seleded tasks is N and die number of available parallel 
processing dements is p, and die means fod: partitioning 
partitions die N selected tasks into p groups of substantially 
N/p size. 

27. An apparatus as in daim 16. wherein the means for 
selecting indudes placing a limit L on die nuaibcr N of 
selected tasks from among a number M of available tasks, 
and if the available tasks M are equal or greater than the 
number L then N=L, and if M<L, N=M. 

28. A method as in daim 16. wherein the means for 
partitioning indudes means for wdghting the tasks and the 
means for partitioning indudes means for dividing the tasks 
into the groups on the basis of the wdghting. 

29. A system as in claim 16, wherein the preprocessor 
applies to each task of a program a designation that identifies 
the ordering of the task in the sequence. 

30. A system as in claim 16. wherdn the assignment 
manager assigns priorities on the basis of high to low 
priorities for first to last in the sequential ordering such dial 
said nneans for entering enters tasks in the paralld process- 
ing elements on the basis of high to low priorities fcf first to 
last in the sequential ordering. 

» ♦ ♦ ♦ * 
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